The focus of this work is to study how to efficiently tailor ConvolutionalNeural Networks (CNNs) towards learning timbre representations from log-melmagnitude spectrograms. We first review the trends when designing CNNarchitectures. Through this literature overview we discuss which are thecrucial points to consider for efficiently learning timbre representationsusing CNNs. From this discussion we propose a design strategy meant to capturethe relevant time-frequency contexts for learning timbre, which permits usingdomain knowledge for designing architectures. In addition, one of our maingoals is to design efficient CNN architectures -- what reduces the risk ofthese models to over-fit, since CNNs' number of parameters is minimized.Several architectures based on the design principles we propose aresuccessfully assessed for different research tasks related to timbre: singingvoice phoneme classification, musical instrument recognition and musicauto-tagging.
展开▼